Using String Kernel for Document Clustering

نویسندگان

Qingwei Shi

Xiaodong Qiao

چکیده

In this paper, we present a string kernel based method for documents clustering. Documents are viewed as sequences of strings, and documents similarity is calculated by the kernel function. According to the documents similarity, spectral clustering algorithm is used to group documents. Experimental results shows that string kernel method outperform the standard k-means algorithm on the Reuters-21578 dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Clustering with String Kernels in R

We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering techniqu...

متن کامل

Loose Phrase String Kernels

When representing textual documents by feature vectors for the purposes of further processing (e.g. for categorization, clustering, or visualization), one possible representation is based on “loose phrases” (also known as “proximity features”). This is a generalization of n-grams: a loose phrase is considered to appear in a document if all the words from the phrase occur sufficiently close to e...

متن کامل

Composite Kernel Optimization in Semi-Supervised Metric

Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the to...

متن کامل

Utilizing the Structure and Data Information for XML Document Clustering

This paper reports on the experiments and results of a clustering approach used in the INEX 2008 Document Mining Challenge. The clustering approach utilizes both the structure and the content information of the XML documents in the Wikipedia collection. The content of the XML documents is measured using the latent semantic kernel (LSK). A well-known problem with the construction of latent seman...

متن کامل

Decision Making with Uncertainty and Data Mining

Complex networks and networked data mining p. 10 In-depth data mining and its application in stock market p. 13 Relevance of counting in data mining tasks p. 14 Term graph model for text classification p. 19 A latent usage approach for clustering Web transaction and building user profile p. 31 Mining quantitative association rules on overlapped intervals p. 43 An approach to mining local causal...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Using String Kernel for Document Clustering

نویسندگان

چکیده

منابع مشابه

Text Clustering with String Kernels in R

Loose Phrase String Kernels

Composite Kernel Optimization in Semi-Supervised Metric

Utilizing the Structure and Data Information for XML Document Clustering

Decision Making with Uncertainty and Data Mining

عنوان ژورنال:

اشتراک گذاری